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Abstract 



In this work we study the vahdity of the so-called curse of dimensionality for 

indexing of databases for similarity search. Wc perform an asymptotic analysis, 
with a test model based on a sequence of metric spaces {fid) from which we pick 
datasets Xj, in an i.i.d. fashion. We call the subscript d the dimension of the 
space ild (e.g. for the dimension is just the usual one) and we allow the size 
of the dataset n = to be such that d is superlogarithmic but subpolynomial 
in n. 

We study the asymptotic performance of pivot-based indexing schemes where 
the number of pivots is o{n/d). We pick the relatively simple cost model of sim- 
ilarity search where we count each distance calculation as a single computation 
and disregard the rest. 

We demonstrate that if the spaces ild exhibit the (fairly common) concen- 
tration of measure phenomenon the performance of similarity search using such 
indexes is asymptotically linear in n. That is for large enough d the difference 
between using such an index and performing a search without an index at all is 
negligeable. Thus we confirm the curse of dimensionality in this setting. 
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Introduction 



One often hears the complaint that we live in a data rich but information poor 
society. Indeed, enabled by advances in IT hardware all sorts of data are being 
gathered at astonishing rates. The force behind this is the assumption that 
useful information is to be found in the heaps of data. However slim the ratio 
of useful to useless data may be, the exercise is considered worthwhile. 

As this is almost inevitably complicated, a new area of active research and 
a business sector were created - those of "data mining" . We will take a look at 
what is perhaps the most fundamental data mining problem of all: given a new 
piece of data, finding similar pieces of data in the pile you have accumulated 
already. This is the problem of similarity search. It should not be confused 
with exact search, the topic usually discussed in introductory computer science 
textbooks. 

The difference is that of finding a book in a library 20 years ago and today. 

In the past, knowing that you wanted to find a book having "Pantagruel" in 
its title meant somehow finding out, by perhaps talking to librarians (who one 
would assume read many books) that the author is one F. Rabelais. Then, you 
would walk down the "fiction" section of authors starting with R, and hopefully 
within a few minutes locate the title. Today, sitting at home and without relying 
on knowledgeable librarians you would start reading the first page of the book 
within seconds of searching for "Pantagruel" in a Web browser. This miracle is 
due in no small part to clever solutions to the similarity search problem called 
indexes. Indexes are structures that organize a database in such a way that fast 
similarity search is possible. 

Is the similarity search problem solved then? Let us consider a slightly 
different problem: we are given a photograph of a mountain landscape with 
no clear giveaways as to the location and yet we would like to know where it 
was taken. A person who knows those mountains will immediately tell you the 
approximate location, as she is able to identify the particular vegetation, rock 
formations and glaciers. But finding such a person is considerably tougher than 
going to your local library. It would be most helpful if one could submit this 
photograph to a search engine containing millions of tagged pictures (the Web?) 
to find the most similar ones. This way our untagged picture will obtain a tag: 
namely geographic information. That no such solution exists (yet) is a testament 
to the difficulty of search in high dimensions. Untagged pictures are composed 
of thousands if not millions of coloured pixels and it is not immediately obvious 
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how to teach a computer to quickly find similar ones. 

This has come to be known as the "curse of dimensionality" and is the 
primary topic of this work. We aim to provide an asymptotic analysis of a class 
of indexes applied to high dimensional datascts of "typical" behaviour. The 
broad conclusion is that high dimension leads almost inevitably to unacceptably 
slow performance of search, akin to waiting 3 days for the web browser to tell 
you about Gargantua and Pantagruel (enough to make the local library suddenly 
competitive) . 

In Chapter 1, we introduce the formal mathematical setting of this search 

problem by defining a space of all queries and datapoints. A distinction is made 
between the database, a finite collection of objects and the potentially infinite 
larger space that acts as the source of "new" objects - the centres of queries. 
Similarity queries are based on objects from the larger space but all that is 
known a priori is the database which must serve to infer patterns in the larger 
space. We provide a cost model for pivot-based indexes and setup the relation- 
ships between all the relevant variables for asymptotic analysis: primarily the 
database size, index size and dimension. 

In Chapter 2, a related phenomenon from asymptotic geometric analysis is 
presented: the concentration of measure. It is the somewhat coimtcrintuitive 
observation that high-dimensional objects appear to be small when we base our 
measurements on samples. For example, when a sphere is sampled the so-called 
observable diameter as a function of dimension tends to very quickly. 

We present various families of spaces, which we call Levy families, that 
exhibit the more particular normal concentration of measure. These families are 
used as examples of spaces that arc interesting to index yet exhibit geometric 
properties that make indexing hard. We show how these geometric properties 
of the larger spaces imply something reminiscent of the curse of dimensionality 
for pivot-based indexes. 

In Chapter 3, we introduce Statistical Learning theory as a tool to con- 
nect concentration of measure on spaces to finite datasets, which are treated as 
random samples from these spaces. Our interest is in a generalization of the 
theorem of Glivenko-Cantelli, due to Vapnik and Chernovenkis, about uniform 
convergence of sampled quantities to their true values. The crucial condition 
for the applicability of this theorem is that the class of balls in a space have a 
low "capacity" : among the different capacity measures that can be used here is 
the VC dimension. If this condition is met we can make conclusions about the 
behaviour of any random dataset. 

The main theorem is presented in Chapter 4: that concentration of measure 
leads, under certain very natural conditions, to the curse of dimensionality for 
pivot indexes. That is, within our model we give a mathematically rigorous 
proof that asymptotically in dimension, all pivot-based indexes are not signifi- 
cantly better than a simple sequential scan of the database. We derive certain 
properties of the speed with which this degradation in performance takes place 
as well. The asymptotic bound is strong: the performance degrades quickly 
in dimension to the point that at least half of all possible queries will almost 
surely require a full scan of the database no matter which pivot-based is used 
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and with which parameters (under some reasonable limits on space). Care is 
taken to present the full set of assumptions of this asymptotic analysis, as its 
conclusion may not hold in other cases, for example when the index is allowed 
a large amount of storage for pre-computation. 

Although the very results we prove demonstrate that indexing in high di- 
mensions is often hard, in Chapter 5 we perform several experiments with pivot- 
based indexes on two different kinds of datasets. The broad conclusion is that no 
particular flavour of index does much better than the rest: performance quickly 
diminishes, which would leave some tough choices for database designers who 
deal with high-dimensional data. Perhaps a pertinent question is whether real- 
life datasets are ever truly high dimensional: it seems to be the opinion of some 
researchers that they almost never are. In this case the doom and gloom we 
prognosticate is mainly theoretical. 
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Chapter 1 



Indexing of metric space 
databases for similarity 
search 



A database is a collection of records with an added structure that allows the 
user to query, update, and delete records in a variety of ways. Databases are 
extremely pervasive in modern life: the contacts list on a phone, a bank's client 
records, the whole Internet, a grocery store's transactions. . . In fact this general 
definition of a database is not restricted to computerized systems: the ancient 
library of Alexandria was a database as well. 

What computerized systems have allowed a veritable explosion in the size 
and number of databases. A parallel for processors is the famous Moore's law 
that hypothesized an exponential growth in the number of components on chips. 
The original prediction shown in Figure 1.1 has been valid, up to a constant, 
for over 40 years. Although less well documented, it has been an accepted 
truth in the business community that the size of databases has been expanding 
exponentially as well: e.g. [Kim] p. 295 and [HegOl]. 

What seems to be happening is a sort of Moore's Law for the size of databases 
as they keep pace with the rise in processing power. 

At stake is the scalability of database systems, as superlinear algorithms 
for querying, updating and deleting are necessarily experiencing a continuous 
degradation of their performance. Thus finding more efiicient algorithms is not 
becoming less important with rising computing speed as is sometimes suggested 
by non-specialists. Put another way, our expectations of increase in performance 
exceed even the astonishing pace of Moore's Law and so more clever algorithms 
need to be devised. 

To illustrate the main topic of this work, we will briefly summarize what 
searching a database typically means in real life. The most common databases 
today consist of several dozen (or more) big tables. An example of a 6-column 
database table is provided in (pardon the overloading of meaning) Table 1. The 
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Figure 1.1: The original Moore's Law. logj of Components per chip vs Year 



two example rows are records, and typically millions of them would exist in a 
large company or government department. Customer ID is in database parlance 
a primary key: it is used by the system to uniquely identify the record. 

Customer ID last name first name address postal code add date 

12783 Black Conrad 50 Beechwood Dr. M1F6H0 2008-05-01 

25456 Minkowski Alice 204-190 5th Avenue H8F1K4 2006-09-25 



Table 1.1: a typical database table 

Perhaps the most well known function of a database is a search of its records. 
An example could be the retrieval of all customers who were entered into the 
database in May 2007. While it is possible to execute this request by looking 
sequentially at each record and seeing if the "add date" attribute matches, a 
more efficient solution is possible. 

Roughly speaking, in a database with n records the sequential scan method 
proposes to look at n records while the more efficient one will only do about 
log(n) lookups. One such method is the B-tree ( [Sed98] ), a structure that puts 
the values of the attribute to be searched in a tree of depth log(n) and hence 
exact search takes only about that many operations. In fact the time to perform 
search can be thought of as constant: even for n = 10^ a search query requires 
only 2 comparisons. [Sed98] 

This "classical" search problem is actually a very particidar case of similarity 
search. To understand why other kinds of search are interesting, it is helpful to 
realize that a B-tree or similar structures only deal with one numeric attribute 
at a time. A structure is built on that attribute alone, which makes for efficient 
searches of a very particular kind: those based on that single attribute. Multiple 
B-trees would have to be built if search on other attributes is to be expected, 
and things get more complicated still with complex queries involving multiple 
attributes at the same time. 
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When more dimensions are added to the mix, the problem not only gets 
more difScult, but no efficient solutions seem to exist. As multiple attributes at 
once are to be taken into account it is no longer possible to order all the records 
in a line. We can only say, given two records, how close they are to each other. 

A database of fingerprints, stored as an array of black and white pixels, 
can be used to illustrate similarity search. The typical query is to find all 
the similar fingerprints to a new sample from the street. This similarity is 
computed using some function of the pixels forming the image of the query and 
any candidate from the database. This is an expensive operation that we would 
like to minimize. In classical search wc would have liked to assign a key to each 
of the set of fingerprints in the world. This way we could ask for "fingerprint 
set 13747" 

The major problem however is that the point precisely is that no two sets are 
exactly the same, even if taken from the same person. The first set will be 13747 
while the second could be 78965, thus making this numbering scheme of little 
use - unless we have some way of ensuring that the second set of fingerprints 
will be always assigned a number "close" to 13747. It is realistic to assume 
that police investigators are interested in the "closest 50 matches" which they 
then would like to review by hand. The database designer then has to provide 
this capability with just one piece of structural information: the function that 
measures the similarity between two sets of fingerprints. 

This problem therefore lends itself well to be characterized in terms of metric 
spaces, (alternative characterizations, without metric spaces, exist [Goy08]). 

1.1 The search problem 

Formally, the problem is framed in terms of a metric space: 

Definition 1.1. A metric space is a set fl equipped with function p : ilxQ R 
s.t. 

• p{ll!i, 0^2) ^ 0, = if and only if uji = u)2 

• p{ijji,uj2) ~ p{uJ2,uJi) and 

• p{uJi,U2) p{uJl,UJ3) + p{uj3,UJ2) 

The function p is a distance function (also: metric ) and it is the main 
mathematical structure of Q. Its most important feature perhaps is the last 
property, known as the triangle inequality. Essentially it is the only tool available 
for inferring distances in a metric space. 

In addition the metric space is equipped with a probability measure, the 
definition of which we assume is known to the reader or can be looked up in a 
text like [Tay06]. It is a way of assigning a weight to "measurable" subsets in 
the space fl to account for different likelihoods of a random point falling into 
them. At least in the case of a finite set it is simple enough: 
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Definition 1.2. For a finite set X a probability measure on the set is a function 
/X : 2^ ^ [0, f ] s.t. E.gx = 1, so that VA C X, fi{A) = 

What we call a dataset is a finite subset X cQ equipped with the inherited 
metric p\x and normalized counting measure 

(which is indeed a probability measure). 

There is good reason to distinguish between the space O and the dataset 
X we have on hand. In the fingerprints database example, we cannot claim to 
have all known and future fingerprints already in the database. More likely is 
the opposite: every time a set of fingerprints come in for inspection, no exact 
matching set exists on file. So the new set must have come from outside X, yet 
X is the space we are trying to search. 

What can fl be in this case? All the possible fingerprint impressions, taken 
under all possible various conditions, of the entire human population? Past, 
present, future? The exact specification of fl may be hard to formalize and 
is ultimately unimportant. Perhaps a parallel can be drawn with probability 
theory where stands for sample space and X for a random variable. 

Furthermore, the measure ;U is as a consequence also hard to specify, but it 
seems reasonable to assume that in any case, some fingerprints are more likely 
than others to show up as queries to our database. Then all we can safely 
assume about /z is that it is some non-uniform measure, which is unknown to 
us. 

X on the other hand is quite concrete - wc literally have a list of all the 
elements at our disposal. In addition to being a subset of Q it is a metric 
subspace in the sense that the same metric p is used to calculate distances 
between points of X. To simply inherit fj, is problematic as we don't actually 
know what it is. Nevertheless we would like to do computations in X taking 
probabilities into account. The solution is to use the counting measure, also 
known as the empirical measure. This may seem crude but it is actually a 
pretty good approximation of the "real" measure as a consequence of a well- 
known theorem in statistics (more on this later). 

Given this structure we can perform the following similarity queries: 
Given the key q € fl, 

• Nearest neighbour: find the k closest elements in X to q 

• Range: find all the elements in X within distance r from q 

• Proportion (variation on the first): return k closest fraction of X to q 

An example above was of a 50 nearest neighbour query. Supposing that such 
a similarity query on a fingerprints databases would rely on counting common 
"features", a range query can be "find all records that have at least 10 features 
identical to this sample q". What these two queries have in common is the 
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underlying idea that they will return a small set of results that a human can 
then manually examine. Effectively the different kinds of similarity queries are 
closely related, and studying one sort is not a serious limitation (see [ChaOl]). 

More specialized queries, e.g. finding all pairs of nearest neighbours, are 
approached differently both in the building and analysis of algorithms and so 
will not be covered. 

1.2 Indexing for search 

To answer a similarity query we can revert to the strategy of looking up every 
element in x X and calculating p{q,x). We will call this a linear scan, as it 
is clearly linear in the size of the dataset. However we will suppose (what is 
often the case) that the calculation of distances p{q, x) is the most expensive 
operation [ChaOl]. 

In such a setting the linear scan approach is very slow: if a single dis- 
tance computation takes 1/lOOth of a second, several weeks would be needed 
to traverse a database with 100 million records. Distance functions that take 
milliseconds and more to execute are mentioned for examples in [Cia98]. 

An index is an added structure to the database that facilitates operations 
like searching. Here we will restrict the use of the word to the scope of this 
work: 

Definition 1.3. An index is a data structure whose aim is to speed up the 
execution of similarity queries on a particular dataset, typically consisting of 
some pre-calculated values and an algorithm. 

Let's quickly introduce a simple index, Orchard's algorithm [Cla05], as an 

example: 

Example 1.1. For the n points \n X, we create an n x n matrix of distances 
p{x,y): each row corresponding to all the distances to x, sorted in increasing 
order. To perform 1-ncarcst neighbour search query with centre q, we pick a 
random element x and go throgh the row of distances corresponding to x. If we 
find y s.t. p{y, q) < p{x, q) we switch to the row of y. 

We avoid going through all the elements of X by applying the following 
criterion: we stop going through the list of y if the last seen element z satisfies 
p{z,y) > 2p{y,q). This follows from 

p(z: y) q) + pi<i^ 'A 
piz-. q) ^ piz, y) - p{y- q) 

=> P{z, q) > y^p{z, y) > p{y, q) 

The assumption having been applied twice in the last part. If wc can't find z 
such that p{z, y) > 2p{y, q), y is returned as the answer. We can further improve 
the search by applying various strategies to avoid unnecessary lookups. 

Practically speaking the reason to have an index is that the several weeks 
taken by a naive approach can be reduced to a few hours, maybe even only a 
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few minutes. The consequences cannot be underestimated, for, as amazing is 
our ability to amass data the "kiUer app" is search: a collection of 100 million 
sets of fingerprints is interesting, but a system capable of handling thousands 
of queries per day is downright useful. 

Dozens of ideas of how to build such an index were presented and new 
ones, having various advantages and analyzed in different ways, are invented 
continuously. Attempts at categorizing and providing a unified framework are 
recent: a book on the subject appeared in 2005 [Zez05] and the first international 
conference on similarity search [SISAP] was held in 2008. 

Several high-level views exists as to how mathematically describe the func- 
tion of an index in performing similarity search. 

One is outlined in [ChaOl]: an indexing scheme works by partitioning the 
space n into regions. In other words the space is decomposed as O = UiRi 
where the Ri are finitely many, pairwise disjoint subsets of the space. A query 
algorithm takes advantage of this partitioning as follows. Through the use of 
some inequalities - the triangle inequality in some form - we discard some of 
the regions as it can be proved that no elements of X lying in those regions can 
possibly be in the query results. The elements of X in the undiscarded regions 
have their distances p(q, x) computed through a linear scan to determine if 
they should be returned. As mentioned above a linear scan is nothing but a 
sequential lookup of each of the region's elements, so the key is to eliminate 
as many regions as possible. A high level pseudocode description is given in 
Algorithm 1. 

Data: query, index 

Result: query Results 

for each region in index. Regions do 

if (NOT index. IsExcludeA(region , query)) then 
I append LinearSearch(query, region) to queryResults; 
else 

end 
end 

return queryResults; 

Algorithm 1: Use of an index for query execution: regions 

The triangle inequality and some variations thereof, as used in indexing to 
avoid sequential scan are listed in [Zez05]: 

• Double sided triangle inequality : 

in other words, knowing distances to a third point Ws gives us constraints 
on p(a;i,a;2) 
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• We only know that r; ^ p{clJi,ui3) < r/j while p{u)2,u}s) is still known 
exactly. Then: 

max{p(w2,W3) - rh,ri - p(a;2,W3),0} < p{oJi,u>2) < p(w2, W3) + rh 



Given these simple tools the diversity of the various indexing schemes is as- 
tonishing. There arc tree and flat structures, structures that try to optimize 
inserts and deletes, trees that are deep or shallow, balanced or not, with claimed 
computational complexities from constant to exponential in n = size of X. Fur- 
thermore because search is a topic of research in multiple disciplines, including 
for example pattern recognition [Dev97] where a "nearest neighbour" algorithm 
is almost equivalent to nearest neighbour search, the same solutions have been 
reinvented multiple times and called different names [ChaOl], [Cla05]. 

Another reason for this multiplicity of algorithms has to do with the fact that 
no "best" algorithm has been found yet and so a number of solutions offering 
specific space-time tradeoffs or other features have been developed. 

It is important to note that in some situations indexing is not possible. 
Some metric spaces are so general that the distance function does not provide 
any usable information. We will first consider a trivial example, that of the 0-1 
distance [Cla05]. 

Example 1.2. Suppose we have (fi, p, p) such that 



Then, all query types reduce to finding an exact match. Furthermore no amount 
of storage of the distance functions between various elements of X can facilitate 
a search query as no information can be added. So no metric based index works. 

Of course this particular example may not seem very convincing as we al- 
ready mentioned the existence of efficient solutions for exact match. Even in 
the absence of keys, we could use a hash function [Sed98] to reduce the problem 
to a quick binary search. 

Another, more substantive example is given by [Lif07]. 

Example 1.3. In this space, based on a graph generated from web mining [BloOO], 
the distance function is such that for any two distinct elements Wi, UJ2'- 



Then in this case inferring p{ui,u>2) based on knowledge of distances to a third 
point is not informative. We can verify that all the commonly used variations 
of the triangle inequality listed above reduce to something we already know, 
namely that distances lie between 1/2 and 1. 

Spaces that are impossible or at least very hard to index are by no means 
rare - their high incidence is a whole subject of study. 





if Wi = (J2 

1 if Wi ^ U)2 



l/2s$p(a;i,W2)<l 
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1.3 The curse of dimensionality 



An often repeated observation is the inability of algorithms to deal with high 
dimensional datascts (e.g. [Bey99]) a phenomenon described as the curse of 
dimensionality. Simply put, when algorithms are run on Euclidean datasets of 
increasing dimension, performance drops exponentially as a function of dimen- 
sion. 

The concept of dimension in a general metric space is less precise. Clearly 
it has to obey our intuition in Euclidean space so for example a plane in the 
10-dimensional space R'' is still 2-dimensional, and it would be desirable for a 
uniformly distributed ball in to be d-dimensional. But what to do about 
datasets like sets of fingerprints, or movies, where an intuitive notion of dimen- 
sion is harder to develop? 

One approach is to focus on the metric space properties of il, and take 
advantage of already known concepts for metric spaces, such as packing numbers 
and e-nets. 

Definition 1.4. An e-net C of (O, p) is a set of points from O satisfying 

[JcecB{c,e)=fl 
where B{c, e) = {w G c) ^ e} 

The size of the minimal e-net is called the covering number of {ft, p) and 
denoted C{il, e) as it is a function of e. 

It is not hard to convince oneself that at least in Euclidean spaces higher 
dimension leads to higher covering numbers. For the unit cube in the cov- 
ering number is on the order of l/e'' [Cla05]. This concept is extended [Cla05], 
to define the Assouad dimension (see algorithmic complexity notation later in 
Table 1.3): 

Definition 1.5. The Assouad dimension of (ri,p) is the number d satisfying: 

sup C(B(a;,r),er) = l/e'^+°(i) 
wen,r>o 

A small Assouad dimension is a very strong requirement: all the balls in 
the space have to be well behaved in the sense that they admit small covers. 
Perhaps unsurprisingly results exist that show feasibility of indexing for simi- 
larity search in case of small Assouad dimension [Cla05] . However the curse of 
dimensionality stands: these algorithms have an exponential dependence on the 
Assouad dimension. This concept is also too complicated to compute for real 
datasets where fl is unknown: it is not clear how to estimate it from X alone. 

As the number of proposed concepts of "intrinsic dimension" for the purposes 
of similarity search is growing, [Pes07] outlines desirable properties we should 
look for. Included are ease of computation and definition for discrete spaces in 
such a way that the intrinsic dimension of X is closely related to the intrinsic 
dimension of fl. 
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An easy to compute and reasonable proposal for a dimension is mentioned 
in [ChaOlb]: 

-_ EipiX,Y))^ 
2Var(p(X,y)) 

whore X, y ^ /i, the distribution of points in fl. Using relatively small samples, 
the empirical intrinsic dimension as computed for various objects in table 1.2 



give reasonable answers. 


Space 


calculated dimension 


20-dimensional uniform unit cube 


27.6 


the NASA dataset [DIMACS] 


3.9 


20-dimensional uniform sphere 


20.8 


100-dimensional uniform sphere 


139.5 



Table 1.2: Some empirical approximations of Chavez intrinsic dimension 



This concept is based on looking at the histogram of distances 

{d{q,x)\x e X} 

Given a query centre q, if the histogram of distances from q to points in X 
shows a lot of "concentration" , this will be a hard query to process as most 
points will need to be checked. By concentrated we mean of low variance, while 
the mean of the distances is in the numerator to account for different scales. 
Explained another way, we would like a uniformly distributed unit cube in M'' 
have the same dimension as a uniformly distributed cube twice the size in the 
same space, so there is a need to normalize. This will often be a non-issue as 
we would normalize the distance fimction so that F,(p{x,y)) ^ 1. 

The intrinsic dimension of [ChaOlb] then is an average measure of all the pos- 
sible histograms taken from "viewpoints" q. In reality both /i and O are either 
unknown or unworkable so this measure is to be estimated from X. Underly- 
ing is the assumption that datasets exhibit a certain amount of homogeneity of 
viewpoints, that is histograms taken from different q will still look similar - a 
hypothesis with some experimental validation [Cia98] . 

For "truly" d-dimensional structures in Euclidean space, e.g. uniformly dis- 
tributed unit cube, this d corresponds to d, (asymptotically). 

Using this intrinsic dimension it is possible to derive a lower bound on the 
number of distance computations required as a function of d [ChaOlb]. This 
bound however is not very strong - on the order of d ln(n). Thus only for large 
d relative to n is this lower bound significant. This leads to our next topic: how 
big is dimension allowed to get, in relation to n? 

Henceforth we will use d somewhat ambiguously, referring to either the usual 
notion from vector space theory or one of the intrinsic dimension concepts with 
the understanding that they are all coincide, at least asymptotically. 

As mentioned in [ChaOl], for a fixed dimension d and fixed n, we can find 
indexing schemes that are fast. The problem that we would like to analyze has 
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to do with scaling these algorithms as both d and n grow. This requires us to 
make precise how fast we let these two quantities grow in relation to each other. 
This invariably leads to the use of algorithmic complexity notation, a summary 
of which appears in table 1.3. 



notation 


definition 


fit) = Oigit)) 
f{t)^oig{t)) 

fit) = Qigit)) 

fit) = uigit)) 


for some C > 0, eventually fit) < Cgit) 

for any C > 0, eventually fit) < Cgit) 

for some Ci, C2 > 0, eventually Cigit) < fit) < C2git) 

for any C > 0, eventually fit) > Cgit) 



Table 1.3: Algorithmic complexity notation 



An asymptotic analysis will therefore involve both: 

d CXD 

and 

n — > 00 

We will try to argue for what the relation between d and n should be by going 
back to the more fundamental question of what is an efficient index. It is clear 
that we should be able to perform similarity queries with less time than that 
taken by a linear scan. In the language of algorithmic complexity, we require a 
sublinear complexity in n, that is 

querytime = o(n) 

where by querytime we mean the average time it takes for a similarity query to 
execute, time measured in distance computations. The average here is computed 
over a reasonable space of possible queries, on which we will touch later. 

Storage is also important, with at most polynomial storage allowed (but in 
practice even may be too much): 

storage = n^^^^ 

For our particular indexing scheme the storage is measured by the number of 

distances stored. We do not make a distinction among the different ways to 
store a real number, in all cases it is considered as 1 unit (of cost). This covers 
a large number of indexing schemes that are essentially arrays of pre-computed 
distances. 

As our main concern is with asymptotic analysis it is also to specify bounds 
on d. We will follow an approach in the authoritative survey by [Ind04] and 
focus on a particular range for d: superlogarithmic but subpolynomial in n. 
Expressed using the notation, 

d = w(logn) (1.1) 
rf = n°(i) (1.2) 
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The reason for the lower bound (1.1) is due to a case study which requires the 
definition of Hamming cubes. 

Definition 1.6 (The Hamming Cubes S'^). The Hamming cube of dimension 
d is defined as the set of all binary sequences of length d, that is its elements 
are of the form 

a; = (0,l,l,0,l,...,l) 

and the distance between two strings is just the number of elements they don't 
have in common divided by d: 

p{x,y) = ^ 

This metric is known as the normalized Hamming distance. We will give the 
cube a uniform measure for this discussion. 

It turns out that for at least this case, when d grows slowly, say d = 0{\og n) 
the entire space is so small relative to the size of the dataset that all possible 
queries can be pre-computed and stored without breaking the polynomial stor- 
age requirement. The size of O = E'' is just 2'^ which becomes on the order of 
n for sublogarithmic d. As there are on the order of n possible radii, there are 
only ri^ possible queries which can be all precomputed. So to build a general 
framework for asymptotic bounds it seems necessary that d grow strictly faster 
than log n 

As we consider algorithms that are exponential in d to suffer from the curse 
of dimensionality, we will require querytime polynomial in d ([Ind04]): 

query time = d*^'^' 

This upper bound on d results from the observation that if d grew so fast 
that n = d^^^^ a sequential scan would be polynomial in d. As nothing needs to 
be proven in that case, we focus on when d is subpolynomial in n and require 
an algorithm polynomial in d and hence subpolynomial in n. 

We will adopt the view that these bounds on d are a reasonable setting for 
the investigation of performance of various index based query algorithms. While 
d grows fast enough to not render the problem trivial, we disregard high rates 
of growth for which proven examples of the "curse" already exist. 

Summarizing: 

The goal of finding a scalable index is to find polynomial (preferably degree 
less than 2) n storage algorithm that allows search in polynomial d time. 

This stands in contrast to the curse of dimensionality conjecture, whose 
form we borrow from [Ind04]: 

If d = ci;(logn) and d = n°^^\ any sequence of indexes built on a sequence of 
datasets Xd C Tid allowing exact nearest neighbour search in tinne polynomial in d 
must use rf^^^ space. 

At the moment of writing a proof of above has not been found. We provide 
it here for pivot-based indexes. 
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1.4 Pivot-based indexing 



Pivot-based indexing algorithms (for example AESA, MVPT, BKT,...see [ChaOl] 
and [Zcz05] ) rely on a selection of elements from X that are used as proxies for 
the rest of the dataset. That is, distances from all elements of X to the pivot 
elements are computed and then used to cut down computations through the 
triangle inequality: 

Given pivot set {pi . . .pfe}, we compute the n x k array of distances 

p{x,Pi), l^i^k, xGX 

This array serves as the index. 

Given a range query with radius r and centre q, the k distances p{q,pi) ■ ■ ■ p{q,Pk) 
are computed so that p{q, x) can be lower-bounded as follows: 

p{q,x) > \p{q,Pt) - p{x,p^)\ 

since this happens for any i, we can establish: 

p{q,x)^ sup \p{q,Pi) - p{x,pi)\ 

It is useful to think of a new distance function, based on the k pivots: 

Pk{q,x):= sup \p{q,Pi) - p{x,p^)\ 

The fact p{q, x) > Pk{Q, x) can be used as a condition to discard all x satis- 
fying: 

Pk{q,x) > r 

Therefore the algorithm consists of checking this condition, and if it is not 
satisfied, performing (the expensive) distance calculation to verify if 

p{q, x)>r 

Only if it is again not true do we know that the point should be returned in 

the query. This process is described in Algorithm 2: we call the new distance 
function pk as index. distanceK to emphasize that is a function belonging to 
the index. 

We will focus on range queries with pivot-based algorithms chiefly because 
they are easier to execute. At least in theory k-nearest neighbour queries can 
always be simulated by a range query with the radius set to the distance to the 
fcth neighbour [Zez05]. 

As the iteration Algorithm 2 is happening on all the points of the dataset it 
may appear as this algorithm does not fit the framework of "regions" . But it can 
always be considerate a degenerate case where the regions consist of singletons 
of points in the dataset plus the rest: 

O = (U,ex{4) U {n\X) 
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Data: qTicry, index 

Result: query Results 

for each point in dataSet do 

if index. distanceK (point , query, center) < query. radius then 
if distance (point , query. center) < query. radius then 
I append point to queryResults 
else 

end 
else 

end 

end 

return queryResults; 

Algorithm 2: Querying a pivot-based index 



This differs at least on a theoretical level from the decomposition presented 
in [ChaOl] where an equivalence relation on the points of fl is proposed: 

Wl ~ LU2 VI S$ « ^ k,p{uJl,Pi) = p{lU2,Pi) 

This equivalence relation is then made to induce a partition of the space. In 
Euclidian space these partitions are intersections of spheres, which for even small 
k will be single points. 

Perhaps a more useful characterization also presented in [ChaOl] is to think 
of the pivot based indexing as sending Q to a different space and then performing 
a range similarity search in the new space: 

(/°°(fc),/--norm) : u ^ 

The new space consists of sequences of reals of length k, with the max-distance 

also known as the ^°°-norm. This is akin to our musing at the beginning of 
the chapter where we admitted that having a function that sends every set of 
fingerprints to a number could be useful if the function had properties that 
allowed us to avoid a sequential scan of the original space. 

As our cost model only counts distance computations in the original space, 
a range search in Z°° is considered free. That is our results stand even under 
the generous assumption that it takes time to perform a search in 

We will denote by C all the points of X satisfying 

Pk{q,x) > r 

that is all the discarded elements. Making C large is the primary way of cutting 
the cost of search in the setting of distance computations as dominating cost. 
Of course we can achieve this trivially with a very large number of pivots. This 
will defeat the purpose however as 

Cost of range search = k + \X\C\ 
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The most often used solution is to keep adding pivots as long as it is found 
experimentally to decrease the cost of search. If k is small, on the order of 
logn (as often space limitations require), the most important component of 
cost becomes the size of C and this is where the choice of pivots would seem to 
matter. Various approaches to pivot selection have been investigated in [Bus03]. 
The empirical results seem to suggest that a moderate reduction in the number 
of distance computations can be achieved, although the relative improvement 
drops with increasing dimension. 

There are numerous refinements on this basic approach to pivot-based in- 
dexes, but the underlying idea of using the triangle inequality together with 
the pre-computed distances is the same. Moreover [ChaOl] argues that pivot- 
based indexes are one of only two types of metric space indexing algorithms, 
the other type being also closely related. Therefore investigating this barcbones 
pivot index can be thought of as representative of a large number of actual 
implementations with the unecessary complications removed. 

To recap, we arc hoping that a judicious choice of the pivots (in particular 
their number k) will result in an average C that is big, preferably on the order 
of 99% of n (the size of X). Better yet would be to guarantee that X\C is no 
more than some fixed number, say 1000, irrespective of size of n. This way only 
the remaining elements will have to be totally searched - which will produce an 
efficient algorithm as long as we keep k reasonably small. 

In situations involving the concentration of measure phenomenon this sce- 
nario cannot happen. In fact we will show that the exact opposite takes place. 
The set C will almost certainly be small, and most of the dataset will have to 
be exhaustively searched. 

1.5 Approximate Search 

A related problem to similarity search is approximate similarity search [Zez05]. 
The approximate version of say nearest neighbour search only requires that the 
element returned be within 1 + e distance of the "true" result: 

Definition 1.7. Approximate Nearest Neighbour Search. Fix e, > Let 

Pnn(5) = \ni{p{q,x)\x e X,x q] 

represent the distance from q to its nearest neighbour in X. Then an approxi- 
mate nearest neighbour of g in X is any element x satisfying 

p{q,x) ^{1 + e)/3NN(g) 

with probability at least 1 — ry. An aproximate nearest neighbour search query 
asks for any such x, with apriori set confidence factor ry. 

There are some indications that approximate search is more efficient [Zez05],. 
However as pointed out in (e.g. [Lif07], [Bey99], [PesOO]), due to the concentra- 
tion that many spaces exhibit, almost all points in a typical high dimensional 
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dataset lie at about the same distance to q so approximate search is of limited 
usefulness. We will look close at concentration in the next chapter. 
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Chapter 2 

The concentration of 
measure phenomenon 

High dimensional spaces pose a problem for algorithms: dimensionality appears 
to affect [DonOO] whole classes of algorithms in optimization, numerical integra- 
tion and database search. Cost estimates (running time) of solutions depend on 
dimension exponentially - something that has come to be known as the "curse 
of dimensionality'' . This term, although now liberally applied to any problem 
that seems to depend exponentially on dimension, was first used in [Bel61] to 
describe the unit hypercube I** in d dimensions. 

This space exhibits in a certain sense a growing sparseness. That is if we 
were to take a "small hypercube" neighborhood around a point expecting to 
capture a proportion r of the space I*^, the side-lengths of the neighborhood will 
have to be r^/'^ ([FriOl]). For a given proportion, say 1%, this means r = 0.1 
when d = 2 yet it grows to r = 0.79 when d = 20. Meanwhile the side lengths 
of the whole space remain one. 

Another way of looking at this effect is to compare the volume of the unit 
ball and the unit hypercube. The volume of the unit hypercube is clearly 1, 
while the ball's is 

(27r)''/^ 
d\ 

a value that goes to with increasing dimension. This observation is used to 
argue that most of the points in the hypercube lie "near the edges" . 

Yet another approach is to spread points uniformly and ask what is the aver- 
age distance between any given two. In the case of the hypercube it seems that 
no closed form expression for this number exists, but an approximate expression 
is ^/dJ^ ([And76]). As there is nothing special about the centre except that it 
is somewhat closer to the "average" point, this shows that the mean distance 
between any two points grows at a rate of about ^/d as a function of dimen- 
sion. This is a heuristic argument often quoted for the hardness of function 
approximation in high dimensions ([FriOl]). 
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Figure 2.1: Simulation of Normalized Distance between two points in the hy- 
percube for d = 3, 30, 300 



It is legitimate to observe that the diameter of the hypercube is exactly \fd 
and thus a question that comes naturally to mind is why should this be called a 
curse of dimensionality when perhaps a more appropriate description is "curse 
of hugeness" . After all a rectangle with side lengths k and 1/k exhibits a similar 
behaviour: the average distance between two points is about k/3 [Dun97] which 
will go to infinity with k. Yet it remains a low dimensional object that does not 
pose a problem for the aforementioned algorithms. 

So is there some effect specific to high dimension? A simulation approach 
is to generate (pseudo) random points on cubes of various dimensions and take 
the rcsiilting histograms as approximate probability densities. The result pre- 
sented in Figure 2.1 is a more nuanced view of the distribution of distances 
in high dimensions: the histograms plotted show the distribution of distances 
for various dimensions d, normalized by The average normalized distance 
tends to a constant, but it appears that the distribution is more concentrated 
with increasing dimension. A plot of the empirical standard deviation [She07] 
shows a decrease with dimension. 

d standard deviation 

3 OJT 

30 0.04 

300 0.014 

3000 0.004 



This illustrates what is sometimes called a "benefit of dimensionality" [DonOO] , 
namely the concentration of measure phenomenon. 

It is a well-studied topic of geometric analysis and is a much more powerful 
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Figure 2.2: Projection of uniform samples on spheres of various dimensions d 

statement than the one wc have made about variances. As a preview we shaU 
take a look at one more set of pictures - of the high dimensional unit sphere. 
In addition to being an appealing object it is naturally normalized with respect 
to distance: the maximal distance between two points remains 2 no matter the 
dimension. 

In order to draw a (2-D) picture of an object we have to find a way of 
projecting it onto flat space. An orthodox choice would be to take any two 
coordinates, say the first couple, and plot the resulting figure. Doing it for 
several different pairs will give us different views, and thus maybe the whole 
object will be known. When the rf-dimcnsional sphere is sampled according 
to the uniform measure, and projected onto the plane by say taking the first 
two coordinates the result is rather peculiar: most points concentrate near the 
centre of the image. A simulation for various values of d is provided in Figure 
2.2. 

This happens no matter which coordinates are chosen for the projection. 

The 2-D picture is the same: a small core in the centre, with nearly nothing 
around. If we were to attempt to calculate the diameter based on this picture 
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it will seem that the sphere, actually of constant diameter, is shrinking (with 
sample size kept constant). 

Although a bit more difficult to imagine, when taking a look at the equator of 

the sphere, most points will lie a short distance from it. Again, it doesn't matter 
if the equator is the standard one - any equator will have this concentration 
around it. 

The most convenient setting for finding "high-dimensional" objects is M'' but 
the phenomenon of concentration is phrased in terms of measure and distances, 
so it can be defined on metric spaces equipped with a probability measure. As 
usual, whenever we take the measure of a subset of the space we will restrict 

the discussion to measurable sets. 

Definition 2.1. Given a metric space {i^,p) equipped with (probability) mea- 
sure n, is the e-neighborhood of A c fl, that is 

Af = {lj € ri|/3(w, a) < e for some a G A} 

We want to define a function a s.t. if /i(A) 5^ 1/2 then 

)j,{A^) ^ 1 - a(e) 

In a sense we will pick the best such a and call it the concentration function: 

Definition 2.2. Given a metric space equipped with (probability) measure 

{n, p, p) its concentration function a = Oi{n,p,ix) is defined as 

a(0) = 1/2 

a(e) = sup{l- At(A)|A C n,p{A) ^ i} , e > 

To put it less formally, we are trying to measure how much of the space 
remains after "fat" is added to a somewhat large set in the form of an e neigh- 
borhood. When very little remains, we say that the concentration of measure 
takes place. Making the concept of "little" more precise, normal concentration 
of measure is considered to be taking place when C, c > exist such that 

a(e) < Ce"^'*^' 

Where d is the (intrinsic) dimension. 

2.1 Examples 

Example 2.1. M'' with the Gaussian measure 7. The Gaussian measure is de- 
fined on the completion of the Borel a-algebra. It is the generalization of the 
normal probability measure on M. For any A in the above-defined cr-algebra of 
measurable sets. 
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Figure 2.3: The concentration functions of various spheres 

where A"^ is the d-dimensional Lebesgue measure. 

The space R'^with the measure 7 and the standard Euchdean metric coming 
from the L2 norm produces a concentration function bounded as follows: 

a(e) < e-^'/2 

This does not produce a normal concentration function. This is due to a certain 

stretching of the space that occurs as d grows, something that is not desirable 
from our perspective. In the upcoming example of Hamming cubes we will show 
explicitly how a distance measure can be "properly" normalized so as to produce 
a normal concentration function. 

Example 2.2. The spheres S'' in M''+i. 

Taken with the geodesic or Euclidian distance and the normalized invariant 
measure they produce a family of concentration functions bounded as follows 
[LedOl]: 

In this case an exact expression for the concentration function is known [BenOO] 
p. 282, as the half-sphere, of all subsets of measure at least 1/2 will always 
produce the smallest e-neighborhood, no matter the e. The measure of this 
neighborhood is given by 

J cos'^-2 xdx"^ I {^j ' ^"^"^"^ 

An estimation of this value can be arrived at via numeric integration. A plot of 
the resulting concentration functions, for several values of d, appears in Figure 
2.3. 

This example is particularly interesting as increasing dimension leads to 
increased concentration of measure phenomenon. 
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Definition 2.3. A sequence of spaces (0^)^^ a normal Levy family [Mil86] if 
C, c > exist sucli that 

Thus it the same notion of what is a tight concentration function as above. 

Example 2.3. The Balls B*^. Taken with the Euclidean distance and the uniform 
probability measure (d-dimensional Lebesgue) form a normal Levy family. 

Example 2.4. The Hamming Cubes E''. The Hamming cubes, as defined in 
Section 1.3, with the normalized distance and uniform measure form a normal 
Levy family. 



The concentration of measure can be equivalently described in terms of Lipschitz 
functions. 

Definition 2.4. A function / : f2 — > M is 1-Lipschitz if 
\/x,yen, \f{x)-f{y)\ ^p{x,y) 
In general a function / is p-Lipschitz if for all x and y, \ f{x) — f{y)\ < pp{x, y). 

We note that p(g, .) the function assigning to w its distance to q is 1-Lipschitz. 
In spaces that have a tight concentration a, Lipschitz functions will be nearly 
constant, and one candidate for this constant is a median value. 

Definition 2.5. A median of function / : /x) ^ M is any number M 

satisfying: 

^l{uJ\f{uJ) s$ M] ^ 1/2 and /i{a>'|/(w) ^ M] > 1/2 

This is a slight generalization of the usual concept of median of a set of numbers, 
only no attempt is made to make it unique. Discrete functions may very well 
have multiple valid values for M. 

Theorem 2.6. For a 1-Lipschitz function f defined on space {Q.,ii,p): 

Ve > 0, ijl{uj\ \f{uj) - M| > e} < 2a(e) 
Proof. Fix e > 0. Set 

A = < M} and B = {w|/(w) ^ M + e} 

then 

= {uj\p{uli, a) < e for some a £ A} 

C {uj\f{oj) — f{a) ^ e for some a G A} 

C B 
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Then by definition of a, we have 



H{B) > 1 - a(e) 
The same argument can be appHed to 

A = {uj\f{u;) ^ M} and B = {cv\,f{uj) ^ M - e} 
Since the probabihty of the entire space is 1, we rework the well-known 
H{B [JB)= n{B) + ii{B) - n{B n B) < 1 

into 

H{Br\B) ^ iJ.{B) + ijl{B) - 1 

thus 

li{B n B) > 1 - a(e) + 1 - a(e) -1 = 1- 2a(e) 

□ 

Lipschitz functions allow us to formulate the concept of observable diameter, 
as illustrated by Figure 2.2, more rigorously. 

Deflnition 2.7. Let k > bo fixed. The K-obscrvable diameter of (f2,p, /u), 
denoted obs — diam^fi is defined as (M/ is the median of /): 

obs-diam«f} = 2inf {£) > 0|V 1-Lipschitz f,n{ui\ \ f{u}) - M/| 

We could reformulate the concentration of measure in terms of how fast the 
observable diameter shrinks to 0. 

Actually calculating the exponential expressions for a is much harder: the 
sometimes complicated derivations can be found in [LedOl] and [Mil86]. How- 
ever certain details of such proofs are important to understand the role normal- 
ization of distance plays. In spaces where the diameter is allowed to grow, e.g. 
cubes of dimension d as at the beginning of the chapter or Hamming cubes with 
non-normalized measure 

d 

P{x,y) =^\xi-yi\ 

i=l 

the concentration function may end up with a bound of the form: 

a{e) < Ce"^^' 

that is, without dependence on d. 

For purposes of similarity search however what matters is not the absolute 
value of the distance, but the proportion in relation to the space. That is, the 
issue is not so much if the range query is of radius 10 or 0.1 but what proportion 
of the space falls in a ball with this radius. So in an asymptotic analysis the 
aim is to keep the radius constant as dimension is growing - in "reasonable" 
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spaces this is accomplished through normaUzing by the diameter. In general 
spaces another normalization, perhaps utilizing the expected distance between 
two points, is required. 

We can illustrate the normalization by looking at two versions of the "Blow- 
ing Up Lemma" (e.g. [Pes07b] )for the Hamming Cubes: 

Theorem 2.8. For a 1-Lipschitz function / : S'' — » M with respect to the 
non-normalized Hamming distance, we have that 

Ve>0, i^{uj\\f{uj)-E{f)\>e}<2e^ 
If however the function is 1-Lipshitz w.r.t. the normalized Hamming distance, 

tUG hO/VG 

Ve > 0, \f{u) - E{f)\ >e}< 2e-<^^' 

The second part follows from the first since a 1-Lipshitz function / w.r.t. 
the normalized Hamming distance is 1/n-Lipshitz w.r.t the non- normalized one. 
Therefore nf is 1-Lipshitz w.r.t. the non-normalized distance and so 

Ve > 0, ii{uj\ \df{Lj) - dE{f)\ > de) < 2e^^ = 26"'^"' 

We are making the broad claim that the exponential decrease in rf is a broad 
phenomenon for spaces that are properly normalized for similarity search. So if 
no such dependence on d is observed as in Example 2.1 it may be just a matter 
of bad choice of distance. 

2.2 Link to concentration of measure 

We would like to demonstrate why in so many familiar spaces indexing, and 
in particular indexing using pivots is impossible. The use of concentration of 
measure in indexing is noted in [PesOO] . It relies on the observation that 

p{-,p) : O ^ M : a; I— > p{co,p) 

is 1-Lipschitz for any p and in particular a pivot. Hence Theorem 2.6 can be 
applied to obtain a bound on the deviation from the median M = Mp of function 

Vr > 0, fj,{u)\ \p{uj,p) - M\> r} < 2a(r) 

since p is a general element of ft the statement holds individually for each pivot 
Pi. We combine these statements: 

Vr>0, i^{uj\ sup \p{oj,Pi)- Mi\> r/2} <2ka{r/2) 

as the probability of the union can always be upperbounded, if roughly, by the 
sum of the probabilities. We note that no assumptions about independence are 
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used: the sequence {pi) can be chosen in any way. We used r/2 so as to get rid 
of the Mil 

Vr>0, iJ,{u\ sup \p{u,pi)- p{q,pi)\> r} <2ka{r/2) 
Comparing this to the definition of C: 

C = {x\pk{q, x) > r} 

it is apparent that the only difference between the set we are upperbounding 
and C is that one is defined over all of Q and the other, just for X. We could 
introduce a set 

C = {uj\pk{q,oj) > r} 

and think of C as the observation of C under 
To restate the upperbound in terms of C, 

Vr > 0, p{C) < 2fca(r/2) 

So in effect, assuming that 2ka{r/2) is small, we have (roughly): 

m(C) « (2.1) 

this is along the lines of [PesOO] yet we would like to find out what happens to 
C. The point here is that if something happens in Q, it will not necessarily hold 
mutatis mutandis in the dataset X. 
The statement 

/x#(C)«0 (2.2) 

describes a situation in which much of the dataset X needs to be totally searched. 
The more theoretical statement (2.1) refers to a situation where the query 
searches over all the elements of f2 in effect forcing X = Q,, something we 
would like to avoid. 

The probability measure p^ is a function of n, the size of the dataset. In 
fact the dataset is a sample (i.i.d) of size n from the probability metric space 
(O, p) and hence p^ = Pn is also a random variable. Such complications show 
that it is important to fully describe the variables that underly the apparently 
simple statements. 

If we let q be fixed, equation (2.1) in effect states 

M(Cg,pi...pfe(„),r(n)) — >0 asn,d — > oo 
where n, d are related as described in section 1.3 and 

Cq,pi...pk^^),r(n) = {x\pk{q,x) > t} 

like before, with all variables made explicit. Similarly (2.2) states 

p 



Mn(Cg,Pi...Pfe(„),r(n)) — >0 asn,d — >oo 
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where convergence in probability is based on the sample, also known as our 
dataset X of size n. 

This situation is further complicated by considering the query center q, the 
pivots Pi, the radius r and even k to be random variables. Here wc will allow q 
to vary over all of f2, while r can be thought of as proportional to the nearest 
neighbour in X to q. In the next section we will see that the median nearest 
neighbour distance in our properly normalized Levy families is bounded away 
from asymptotically. 



2.3 Radius of queries in Levy families 

We have described above how we would like to normalize spaces so that the 
"average" distance between two points stays about the same. Here we will show 

why this also implies the typical radius of a query which here we will assume 
to be the distance to the nearest neighbour of query centre - also behaves nicely. 

Lemma 2.9 (M. Gromov, V.D. Milman). [Gro83] Let {fl,p,iJ,) denote a metric 
space with measure and a its concentration function. Then if A c O, is such 
that > a(7) for some 7 > 0, it implies that /u(A^) > 1/2. 

Proof. Assume not and let B = A^. Then > 0, which implies /u(-B^) ^ 

a(7). But n{A) < l^{B^) , a contradiction. □ 

Theorem 2.10. Let (Qd, Pd, l^d, Xci)'^i be a sequence of metric spaces with 
measure, forming a Levy family together with i.i.d. samples X^ from fi^. 
Assume that n = Ud = \Xci\ = d°^^h Furthermore, if denotes the median 
value of {pd{ijJiTU02)\^i <= ^d} we assume that = 0(1), that is, for some fixed 

Ci,C2>0, yd,Ci<Md<C2. 

Let p^J^^\lo) denote the distance to the nearest neighbour ofuGCld in X^. 

Define md to be the median of p^^^^ (w) • Then there exists some C3 > and 
some D such that Vd > D, md > C3. 

Proof. Assume the conclusion fails, then without loss of generality and proceed- 
ing to subsequence if necessary, md 0. By definition of md, we know that for 
any d, 




It follows that ^ 

Hd sup pd{Bm^{uj)) > - 

and so we can find for any d a point e such that 
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If wc denote by ad the concentration functions of our spaces Q,d we know by 
assumption the existence of C, c > s.t. 



Vd,Qd(e) < Ce"" 

Hence we can find d' s.t. ad'i'y) < I /2nd' and nid' < Ci/8 where 7 = Ci/8 as 
well: this is since eventually, 



Then by lemma 2.9 



It then follows that 



2nd d°(^) 



> 1 - Ce-^T''*' 



that is, since m^' + 27 < 3c/8, 

Md' (B3e,/8(Wd')) > 1 - Ce-'=^"'^' 

But 

3ci 

diameter (^3^1/8(0;^')) < 

So in Orf/ X fid' the measure of the set of points (wi , UI2) for which p^/ (wi , ^2) < ci 
is at least 

1 ^ 



2nd' 

obviously contradicting Md' > Ci . 

□ 

This result frees us from having to consider a radius that vanishes to as n, 
d go to infinity. With this achieved, let us recap our goal: to show that most 
queries are slow, i.e. what we casually referred to as equation 2.2 takes place 
for most queries. What we in fact want is something along the lines of: 

medianq,p,,r (Mn(Cg,pi...pj,(„),r(n))) as n, — > 00, (2.3) 

where the median is taken over all the queries under consideration: any q ^ fid 
and any r at least as large as the distance to the nearest neighbour of g in X. 
As well, for each d and n = ndwe would like to also consider all possible pivot- 
based index schemes (as long as k is within certain ranges we will specify later). 
Why the median? The aim is to show a certain behaviour for m,any queries: at 
least half is dramatic enough. So far we have shown, although the proof was 
just sketched (and with the detail about k left out) that 

mediang,p,,r ,pi---Pk(n),rin))) — ^ as n,d — > 00 (2-4) 
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Which is fine as long a.s X = fid and hence the selection of a small dataset from 
an underlying large space is not taken into account. A more likely situation 
however is of a finite X = Xd and an infinite (or at least much larger) fl = fl^- 
What wc need is to find out when statement 2.4 implies 2.3. To do so we will 
summon the powerful machinery of statistical learning theory. 
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Chapter 3 

Statistical learning theory 



Statistical learning theory has already been used in the analysis and design of in- 
dexing algorithms [Kle97] and is a vast subject. Instead of concerning ourselves 
with the whole, we will just focus on an important part: the generalization of 
the Glivenko-Cantelli theorem due to Vapnik and Chervonenkis. 

In keeping with previous notation, X or {X) will denote a sample of n points 
Xi, X2 ■ ■ ■ Xn, sampled according to some unknown probability measure /x from 
a space fi. The sampling is independent, thus in statistical jargon X is an i.i.d 
sample. Although is unknown, we can obtain useful approximations through 
X. In particular we can create a measure on the space Cl induced by X, the 
counting measure /i„ = /i„(X): 



In section 1.1 we also referred to it as fx^. We will use the subscript n to 
emphasize the sample size and that in effect we are dealing with a whole class 
of measures. As they depend on the sample, these measures are also random in 
the sense that they are random variables. 

To talk of a random variable approximating another one, it is necessary to 
describe what convergence of random variables is. 

Definition 3.1 (Convergence of random variables). It is said that a sequence 
of random variables converges to the random variable Y in probability if 



We denoted convergence in probability hy Yn —^Y. Almost everywhere conver- 
gence takes place if a stricter condition is met: 



\Anx\ 
\x\ 



Ve > 0, lim P (|y„ - F] > e) = 



p 




It is denoted by 



a.e. 



Y 



n 
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It is not too hard to sec that the a.e. convergence iniphes convergence in 
probabiUty. In fact the well known Law of Large Numbers has two versions, one 
for each type of convergence. We present what is in a sense just a variation on 
this law, the Glivenko-Cantelli theorem. 

Theorem 3.2 (Glivenko-Cantelli). Given sample (X) = Xi,X2 ■ . .X4 dis- 
tributed i.i.d. according to any measure fi on M'' we have: 

sup|/Ud(-C!0,r] - /u(-oo,r]| — > 

The convergence is taken with respect the product measure induced by the 
sample. This theorem provides a means of linking the empirical distribution 

Fn{r) := /i„(-oo,r] 

and the actual distribution 

F{r) := oo,r]. 

To paraphrase, it tells us that the empirical distribution converges "uniformly 
in probability" to the actual one. Incidentally, the almost sure convergence also 
takes place. 

We can also see this statement in terms of the empirical measures of par- 
ticular subsets converging to their true measure. This is made clear when we 
restate the theorem as follows: 

sup I fin{A) - fx{A) I (3.1) 
AeA 

where 

A= {(-oo,r]|r e M} 

which makes more apparent a path for extension: to generalize to other collec- 
tions A, on spaces (fi,/i) other than the real line. 

The question of when does (3.1) hold is not trivial as for a general n and A, 
the answer is not always positive: 

Example 3.1. Let 

A = {countable unions of open intervals in [0, 1]} 

that is, the collection of sets which are countable unions of open intervals, 
equipped with the inherited Lebesgue measure fi. Then for any sample (X) = 
Xi,X2 ■ ■ . Xn of points in [0, 1], we can find A & A consisting of say small dis- 
joint intervals around the Xi so that n{A) < 0.1 and /^„(A) = 1 (by virtue of 
containing all the Xi). This results in P(sup^g^ | /i,i(A) — /x(A) | > 0.8) = 1 
for any n and hence no convergence to zero can take place. 



34 



The required convergence fails to take place because the class A is too rich. 
Of course it is also possible that with a different measure convergence will take 
place, but in a setting where /x is unknown it is best to make little if any 
assumptions about it. Extensions of (3.1) with a focus on finding the correct 
restrictions of ^ is a topic explored in [Vap98]. The bulk of the work is in 
finding appropriate measures of "size" of collections that can determine if (3.1) 
takes place, and if so at what rate. 

To that purpose, we connect the potentially infinite collection A with the 
finite sample: 

Given {X) = Xi,X2 ■ ■ ■ Xn the collection A "colours" the sample as follows. 
Each A & A will assign 1 to Xi \i Xi ^ A otherwise it assigns 0. Hence we get 
a colouring of type 

0,1,0,0,1,0 

which is an n-length encoding that might as well have been 

white-black-white-white-black- white 

We denote by N{X) the num,her of such different encodings, when aW A £ A 
are used to colour X. Clearly N{X) ^ 2". What is surprising is that in many 
situations despite a seemingly rich A, we have N{X) <C 2". 

Definition 3.3. The (random) entropy of sample X = X^") is just ln(iV(X)), 
denoted by H{X). It follows that H{X) < nln2. The expected value of H{X), 
w.r.t. the sample distribution (in effect a product of /x's) is called the entropy 
of size n, denoted by H(n): 

H{n) = E 

A result in [Vap98] states that (3.1) is equivalent to 

H{n) n^oo Q 

n 

This however is of little use if /U is unknown and does not guarantee fast (i.e. 
exponential) convergence. 

Definition 3.4. The growth function is defined by 

G(n) = In supN{X) 

X 

where the supremum is taken over all samples of size n ^ 1. It is independent 
of /X and choice of sample X. 

There are two cases to consider for an upper bound for the growth function 
[Vap98]: 

• for all n, G{n) = n In 2 
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• or, for the largest A such that G(A) = Aln2, 



G(n) 



{ 



< A(l + ln(n/A)) if n > A 



n In 2 if n ^ A 



Which says that G can be either linear in n or logarithmic after some point 

A. This A is the so-called VC dimension and it turns out that its existence 
is precisely a necessary and sufficient condition for (3.1). Terminology in the 
literature also deems existence of A the case of finite VC dimension while if no 
such number exists A is said to be of infinite VC dimension. 

So the challenge of going from (2.4) to (2.3) can be reduced to the calculation 
of this VC dimension for some A. Since equations (2.4) and (2.3) are statements 
about sets C = Cq,p^,,,p^^^^^r{n), we would like to estimate the VC dimension of 
the collection of all sets of this form, for fixed k: 



Unfortunately this is a far from trivial exercise as it involves figuring out what 
possible forms can of all of these sets take. 

3.1 Calculating a VC dimension 

To show an easy example of calculating a VC dimension, we will come back to 
the A from Glvenko-Cantelli: 



The calculation in this case can be done with "one's bare hands" : 
For X = Xi e M, a sample consisting of a single observation, only two colourings 
have to be realized. The element (— oo, Xi — 1] G A will paint Xi as 0, while 
(— oo,Xi] will paint it as 1. 

This elementary observation can be phrased as "any sample of size 1 is 
shattered by the class A" . 

However if one takes a two element sample where without loss of generality 
Xi < X2 it is quite clear that no element of type (— 00, r] will contain X2 but 
not Xi. In other words the colouring 0, 1 cannot be realized. 

Therefore if we denote an n-size sample by we have that N{X<^'^^) = 2 
and N{X'-^^) < 4 for any choice of samples. The biggest size of the shattered 
sample is then 1, i.e. the VC dimension of ^ is 1. 

More clever arguments (e.g. [Dud84], [Vap98], [Pes02]) are needed to prove 
that: 

• The VC dimension of half-spaces in R'^ is d -|- 1. Recall that a general 
hyperplane in M'' is defined as 



A — Af^ — \^C, 



'q,Pi...Pk(n),rin)\Q & ^,Pi € fi, r > 0} 



(3.2) 



A = {{-oo,r] |r e M} 



{x G R'^Ka;,?;) = b} 
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for some v gM.'^, b gR. Hence a general half-space is of the form 

{a; e R^\{x,v) > b} 
with the other "half" specified by multiplying v by —1. 

• The VC-dimension of all open (or closed) balls in 

{{x G M'^l \\x - v\\ < r} , where v G M'^, r G M} 
is also d+1. 

• axis-aligned rectangular parallelepipeds in IR*^, i.e. sets of form 

[ai,bi] X [a2,b2] X . . . X [ad,bd] 

have a VC dimension of 2d [Dev97] 

Given the existence of these results for R'^, it is relatively easy to attempt to 
calculate the VC dimension of Ak from the previous section. 
First note that 

C = {w : sup| - I > r} = : | - | < r-})'' 

i 

This allows us to proceed through several simple steps: 

• A set of the form 

W ■ I Ik -Pill - Ik -Pill I 

is a spherical shell, i.e. the intersection of one ball with the complement of 
a smaller ball having the same center. A coconut shell comes to mind as a 
physical example. We note that an intersection of shells is an intersection 
of sets from AU where A is the collection of all balls. 

• Given a collection A the complement collection 

A" = {A^\A e A} 

has the same VC dimension. For assume A shatters X and take any 
colouring of X. The opposite colouring, by putting 1 instead of and 
instead of 1, is produced by some A G A. Then A° G A" produces 
the original colouring. The same argument can be applied in the other 
direction. 

• The VC dimension of balls was quoted above as d+1, hence the VC 
dimension of complements of balls is + 1 as well. The VC dimension of 

the union of the two collections is 

(d+ 1) + (d+ 1) + 1 = 2d + 3 
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This is a consequence of a general result [Vid03] : 

Lemma 3.5. // a collection A has VC dimension and a collection B has 
VC dimension A;, (i.e. both are finite), the union AUB has VC dimension at 
most Aa + Ab + 1 

We note that the above is for a union of the two collections. Another result, 
this time for intersection of sets is mentioned in [Blu89]: 

Lemma 3.6. For {fl,p) = (M'',-L^), an upper bound on the VC dimension of 
An^, composed of k- fold interesections of elements of A of VC dimension A is 

2Afcln(3A;) 

Hence at last we can conclude that the VC dimension of Ak for the case 
O C M" is bounded by 

2{2d + 3)(2fc) ln((3)(2fc)) = k{8d + 12) ln(6fc) (3.3) 

where k is the number of pivots. This also allows us to conclude that convergence 
like in equation (3.1) takes place. Of course we only considered the case of M** 
with the normal Euclidian metric. 

We have already mentioned that the VC dimension of axis-aligned "boxes" 
is 2d. It so happens that all balls with respect the L°° metric are such boxes, 
so we can obtain a bound on Ak for this metric as well. 

Another example comes from considering the Hamming cube. In a simi- 
lar argument to section 1.3, we observe that are 2^* points in a d-dimensional 
Hamming cube, and at most d different radii, so at most d2'^ different balls 
exist. We know from e.g. [Blu89] an upper bound on the VC dimension of finite 
collections: 

Lemma 3.7 (Finite A). If the class A is finite, its VC dimension is bounded 
by log2 |.4|. 

Disregarding the small leftover term, the VC dimension for balls in the Ham- 
ming cube is about d. 

Summarizing our three examples: 

Theorem 3.8. Let us denote by A the VC dimension of collection Ak as defined 
in equation (3.2). Then various upper bounds on A, depending on the space, 
are as follows: 

• For (M'^.L^), A < k{8d+ 12)ln{6k) 

• For {W^, L°°), A < k{16d + 4) ln(6fc) 

• For {T,^, p) (normalized or not), A < k{%d + 8 logs d + A) ln(6A;) 
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The case of the general metric space is harder, for the VC dimension of balls is 
dependent on the "intrinsic" dimension of the space. In [Bou] various capacity 
measures are calculated for general balls in general metric spaces. Crucially, 
the capacity measures are stated in terms of the covering numbers of the space 
which in turn relate to the Assouad dimension discussed above. What 
is lacking is a computationally feasible procedure of estimating this dimension 
for an arbitrary 17, through a random sample X. This underscores why the 
Chavez intrinsic dimension is so attractive, as its easy formulation allows us 
to estimate it accurately with moderate sample sizes (using for example the 
HoefFding inequality in section 3.2). What is lacking however is structural results 
either linking capacity measures to the Chavez intrinsic dimension directly or 
somehow relate it to the Assouad dimension. 

Even if we arc willing to sacrifice the generality of our fl and stick to M'' 
(after all a lot of spaces can be made to sit even if unnaturally in some Mf^) 
a problem remains. A finite VC dimension gives us convergence, but nothing 
about the rate of convergence has been said so far. A worrying sign comes from 
trying out some reasonable values for d and k, e.g. d = 20, k = 50. The bound 
on VC dimension then becomes 49,053 - a much larger number than say the 
VC dimension of spheres of the underlying Euclidian space. How large does n 
have to be for the convergence result to have any value? 

3.2 Rates of convergence 

In addition to the convergence specified by the Law of Large Numbers various 
inequalities exist to describe the rate of this convergence. An example of that 
is : 

Theorem 3.9 (HoefFding inequality). Suppose we have a sequence of i.i.d ran- 
dom variables Xi £ [0,1], 1 ^ i ^ n. Then for all £ > 0, 



where E is the expected value of any Xj . 

This is more precise than the usual statement of the Law of Large numbers: 



We shall go back to Glivenko-Cantelli once more. In terms of A the theorem 
tells us: 

for all £ > 0, P ( sup I /i„(A) - n{A) \ > e] "-^ 
KAeA ) 

With £ > considered fixed, we can ask how fast the expression on the left goes 
to zero. A consequence of the Kolmogorov-Smirnov Law [Vap98] is : 



P(|X-E| > £) < 2exp(-2n£^) 
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This is perhaps unsurprisingly very similar to the HoefFding inequality. This 
convergence is what we will call exponential or fast in keeping with [Vap98] . 

The extension ([Vap98] p. 148) to the case of any A of finite VC dimension 
A is as follows: 

Theorem 3.10. [Generalization of Glivenko-Cantelli] For a collection A of 
subsets offl, of finite VC dimension A, and any measure /u on fl, we have that 
for any e > 0, 



sup I IJ,n{A) - IJ,{A) 



> e 



< 4exp 



A(l + ln(2n/A)) 



- I e- - 

n 



The convergence is eventually like 

exp(— e^n), 

which is again a fast rate of convergence. 

A somewhat different form is quoted in [Dev97] : 



P 



sup I lln{A) - 

AeA 



> e 



< 8exp(A(l + lnn/A))exp 



—ne 
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Except the assumption that A is of finite VC dimension no other information 
is used. In fact since no information about the measure /x is incorporated the 
left side can be replaced by its supremum taken over all possible probability 
measures on the underlying space fl. 

Depending on the specific case, tighter bounds may be possible, using other 
capacity concepts than the VC dimension and a priori knowledge about the 
measure /U [Vap98], [Men03]. 

As an example, we will go back to covering numbers. We can turn any A 
into a metric space by using the exclusive or under some measure. Using the 
counting measure, an exclusive or is just: 

A®nB := UniA UB-AnB) 

As a result we can talk of the minimal e-net Bn{A,e) induced by or the 
minimal e-net B{A, e) induced by ii. If the VC dimension of A is known to be 
finite, the following bound holds (cf e.g. [Pes07b]): 



sup \nn{A) - ij,{A)\ > e 
AeA 



< 8E^ [S(^,£/8)]exp 



—ne 
128 



Then perhaps it is unsurprising that there is a relationship between covering 
numbers of A and its VC dimension (cf e.g. [Pes07b]): 

log B„ £) < A(log n + 1 - log A) 

To free ourselves from the particular choice of sample (which depends on the 
unknown n), we can introduce a new concept 
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Definition 3.11. The metric entropy of A is [Dev97] the function of e defined 



as: 



N{A,s) = sup sup Bn {A, e) 

n n„ 



It then folows that 
P 



sup \pLn{A) - ti{A)\ > £ 



< SN{A, £/8)exp 



—ne 
128 



A natural restatement of these results is to ask how big does n have to be 
for the expression on the left to be less than some r/ > 0. Solving for ry wc get 



128 / 8\ 
n ^ ^ (log N{A, £/8) + log - 1 



The use of some technical inequalities (cf e.g. [PesOTb]) yields a similar result 
in terms of VC dimension: 



n ^ 



128 



A, 2e2 , 8 
A log h log - 



(3.4) 
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Chapter 4 



Linking theoretical and 
empirical observations 

In this chapter we are able to tie everything together for a conclusion about 
indexability. Essentially, we we are now able to go from the theoretical obser- 
vation described by (2.4) to (2.3). Let us be clear that we are not proving that 
for a given dataset all queries are linear for all pivot based indexes. There are 
several points to keep in mind: 

• Convergence in (2.3) is in probability: it is conceivable that for some rare 
datasets indexing is easy. 

• The analysis is asymptotic, i.e. for large n and d, so we do not dispute 
that datasets of small dimension are indexable. 

• We assume certain properties of our underlying spaces {fid) '■ concentra- 
tion of measure has to take place, and the VC dimension of balls is small. 
We think it is not unreasonable to assume these are commonly present, 
but again they are probably not universal. 

Let us walk through to the main result, on the way establishing a bound for k. 
From the previous section, we know that only for large values of n will empirical 
measures be close (up to e) to actual measures with high likelihood {l-rj ). The 
lower bound on n then naturally dep(uids on £, 77 but also on the VC dimension 
A of the collection of sets Ak we are trying to measure. 

Let us fix £ and 77, e.g. for our purposes both can be 1/2. Then by pooling 
all constants, including e and rj but not A we can rewrite expression (3.4) as: 

n > MiA (4.1) 

where Mi > 1. What we would like to avoid is to have the right part of this 

expression grow linearly in n. We know an upper bound on A depends on k and 
d as established in Theorem 3.8. As our concern is for asymptotic behaviour we 
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will simplify this bound to kdlnk . We will generalize to define (an) acceptable 
asymptotic behaviour for the VC dimension of balls: linear in d. 

Now we would like to be able to conclude that the database sizes we en- 
counter arc large enough for Theorem 3.10 to hold (that is the right side of 4.1 
is asymptotically below n). The number of pivots k can potentially ruin the 
day. There are indexing schemes, like AESA [Zez05] or the above-mentioned 
Orchard's algorithm 1.1 where k = n. In such cases, what we have shown above 
is that samples much bigger than n are needed for Theorem 3.10 behaviour. 
Alas (or fortunately?) our sample is all of X and not more - so for large k we 
are not able to make the desired conclusions. 

There are good arguments as to why storage is not practical: in a database 
consisting of an enormous amount of records is there really space for storing 
an index that is a square of that very large number? (keeping Google and 
the Internet as a case in point). It has even been argued that under certain 
assumptions the optimal number of pivots is on the order of In n and that even 
this number is rarely reached in practice [ChaOl]. Therefore it would not be 
unrealistic to restrict the analysis to pivot size much smaller than n. For example 
we could require 

k = O(logn) 

We also have to keep in mind that the query algorithm we have described 
requires at least k distance computation so if fc = n the query is linear in n 
which for our purposes is no better than a linear search. 

A similar discussion applies to dimension d: although it is possible to con- 
sider situations when the growth of d is on the order of n, more reasonable are 
situations in which it grows somewhat slower, as argued in section 1.3. 

Combining d = o(n) with the even stronger condition on k, we can conclude 
that: 

A = o(n) 

and hence asymptotically we know that the right side of expression (4.1) falls 
(much) under n. Since the aim is only for A = o(n), the requirement on k can 
be weakened somewhat: 

* = o(=) 

This will be our standing assumption for k. 
Hence we have a result of the form: 

P(sup|/z#(C)-MC)| >£)<r? (4.2) 
c 

This is equivalent to saying that if (2.4) holds then so does (2.3). 

As we have not made it precise in chapter 2 as to why (2.4) holds, we will 
come back to the concentration of measures in spaces fl^. Restating (2.1) using 
an exponential concentration of measure function and applying the reasoning of 
Section 2.2: 

/x(C) < Msfce"'*''' 
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Wc will sacrifice a certain number of C so that r can be considered a constant 
(see section 2.3): we will proceed with at least half the queries having radius r 
above a constant independent of d. Hence the quantities that vary in d are n 
and k. Since d is superlogarithmic in n, 

Vc > 0, d > clogn 
=^ Vc > 0, cxp(— d) < exp(— clogn) 
=> Vc > 0, exp(— d) < cn 

So e~'^^^ = o(n), and hence 

m(C) = o(n) 

so not only docs /i(C) 0, wc have a bound on the c;onvcrgence as well. In fact 
this holds for at least half the queries simulataneously so: 

median sup /x(C) = o(n) 
c 

This, combined with equation (4.2) gives us the main result. 

4.1 Main result 

Theorem 4.1. Consider a sequence of metric spaces (17^, Pd) where d= 1,2,3,.. 
and the VC dimension of closed balls in space (fid, Pd) as a function of d is 0{d). 
Furthermore the metric spaces come equipped with Borel probabilities such 
that for fixed C,c > the sequence of concentration functions of {ild,Pd,l^d) 
satisfies 

Ve > 0, Qd(e) < Ce.-"'''^ 

For each d an i.i.d. sample Xd of size Ud is selected from fid, according to fid- 
The sample size Ud is such, that d = ujilognd) and d = n°J^^'' if d is expressed 
as a function of Ud ■ We treat Xd as a dataset on which to build an index for 
similarity search. The index built is a pivot index using k pivots, where 

k = o{nd/d). 

We fix any small e,r] > as desired. Suppose we only ask queries whose radius 
is equal or greater to the distance to nearest neighbour of query centre q G Cla 
in Xd. 

Then there exists a D such that for all d ^ D, the probability that at 
least half the queries on dataset Xd take less than (1 — s)nd time is 
less than rj. 

Furthermore, if we allow the likelihood ri to depend on d, we can 
pick rid so that the above holds true and 

oo 

lim rr(i-%) = i 
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We emphasize that this result is independent of the selection of piv- 
ots. 



We have done all the calculations to show this theorem except demonstrate 
the limit. We will nevertheless make a quick summary. First of all, we will refer 
to Theorem 3.8 and Chapter 2 to provide us examples of spaces that meet all 
of the above requirements. It must be said that these families of spaces are not 
peculiar counterexamples but are commonly used in indexing today - except of 
course that only the low dimensional ones are successfully being indexed. 

We showed above how the results of Section 2.2 use the Concentration of 
Measure to deduce the "theoretical version" of linear querying. We could only 
show it for all range queries above some fixed r > 0, but this is not an obstacle 
as in all spaces that we talk of distances are scaled appropriately. In particular 
the expected distance to nearest neighbour approaches a constant as d ^ oo. 
So by adding this requirement of minimum radius we are merely disregarding 
queries that return no elements except perhaps the query center itself. To link 
this to querying in a discrete structure that is a dataset, we used the previous 
chapter's results where the assiimptions included that the VC dimension of balls 
is small. The result then is that querying in these spaces is linear for all high 
dimensions. 

We ignored k from calculating the runtime as it becomes asymptotically 
insignificant with increasing dimension. A k larger than that stipulated in the 
main result will matter, but may trivially give linear runtime in (e.g. if k is 
linear in ) . What remains unanswered here is the behaviour for k in between 
o{nd/d) and 6(nd). 

To derive the asymptotic probability, equation 3.4 can be used to obtain the 
lower bound on r]: 



Assuming independent choices of the datasets X^, and assuming that for each 
d the probability of an event is at least 1 — %, we will estimate the quantity 



T] > exp A log — + log 8 - — — 





which as an asymptotic function of d transforms into 



r] = exp(-d'^(i)) 



■DC 



d=D 



As rid goes to at least as fast as e it is enough to show that 



DO 



lim IT (1 - e"'') = 1 

d=D 



to have 
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lim Y[{l-rjct) = l 

d=D 

as well. 

Observing [Asli71] that for any sequence ^ ryd < 1, 

N N / N \ 

'^-^Vd^JJil-Vd) ^e^plY^ -Vd 

d=l d=l \d=l J 

we can extend this, for any D to: 



d=D d=D \d=D 

We know that as a geometric series, 

oo _ -d 

e 



Vd 



E 

d=D 



Hence we can conclude that 



lim n(l-e-'^) = l 

□ 

The examples given above of the various spaces exhibiting normal concen- 
tration of measure should convince the reader that it is real and widespread, 
though of course not universal. For the VC dimension of balls to depend lin- 
early on d is a more vague requirement as the definition of dimension for metric 
spaces for the purposes of similarity search is an unresolved problem. It is how- 
ever clearly impossible to take out dimension from a discussion of the curse of 
dimensionality, which is after all what we have shown, caveats notwithstanding. 

This is not the first lower bound result for pivoting algorithms or indexes 
in general. There has been research for some time into lower bounds for vari- 
ous cases, though most often for approximate algorithms e.g. [Chak]. A spe- 
cific lower bound for pivot-based indexing already mentioned above is that of 
[ChaOlb]: 

dlogn 

It is not asymptotic, and assumes that k = 8(logn). Furthermore the pivot 
selection is assumed to be random, as opposed to our bound that is applicable 
to any pivot selection technique. 

If we are to apply our restrictions on d (assumed to be asymptotically equiv- 
alent to d), the result is that 

dlogn = w(log^ n) 
dlogn = n°(^) 

So if we are to use these results asymptotically they do not provide strong lower 
bounds on the cost of similarity search. 
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Chapter 5 



Indexing 



experiments 



Wc have borrowed basic code libraries from [SISAP] and followed up on the 
techniques and datasets in [Bus03] to simulate variations of the incremental 
pivot selection building algorithms. The aim of this section is to test-drive 
different approaches to selecting pivots to demonstrate the challenges that arise 
from the curse of dimensionality. 

The construction of a pivot index is essentially a matter of finding the "right" 
pivots. It is well known that a non-random selection of pivots can improve sim- 
ilarity search. The most commonly used technique [Bus03] is called incremental 
selection. The broad aim is to maximize the average distance ppi^...^p- for each 
1 ^ i ^ k. That is, we are doing a greedy search: at each step we find a pivot 
that maximizes average and add it to our pivot list. 

To estimate the average value of Ppi,...,pi, that is: 



we use a sample of A pairs from X, and by averaging Ppi^...^pi {x, y) over all the 
pairs we calculate an estimate. In practice then we are not maximizing distance 
Ppi,...,Pi but an >l-sample version of it. Since our spaces have bounded diameter, 
calculating sample size in principle is not complicated, e.g. using (3.9). The 
curse of dimensionality interferes however by requiring the estimation to be ever 
more precise for higher dimensions. Due to space and time limitations tradeoff 
between large values of A and the other sample - that of pivots - must be found 
experimentally. 

The sample of pivots is made at each step - N random choices for the next 
candidate pivot are made, and the one that maximizes the (A-sample) distance 
gets added. Once we reach k pivots, we calculate the distances from X to the 
set of pivots. That distance matrix, together with the pivot list constitute the 
index which can be used for similarity search as already described in Algorithm 
2. This construction procedure is summarized in Algorithm 3. 

A single experiment consists of constructing this index for a given dataset, 
and then running thousands of queries to calculate the average number of dis- 
tance computations required to answer a query. As queries with larger radii are 



Enxo (Ppi,...,pi(wi, a;2)) 
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Data: datasct, k 

Result: pivot-list, distance-matrix 
Sample A pairs from dataset; 
pivot-list: {} ; 
for i in 1 to k do 

Select N candidates for a pivot; 

Find pivot maximizing distance based on pairs; 

Add selected pivot to pivot-list; 
end 

calculate distance-matrix; 
return pivot-list, distance-matrix; 

Algorithm 3: constructing a pivot-based index 



very slow, the experimentation, as in [Bus03] , is restricted to queries that return 
on the order of 10 to 100 elements out of a typical dataset size of 100,000. 

We have expanded on incremental pivot selection procedure by selecting 
pairs of points from X that lie close to each other. The underlying idea is 
to concentrate on points that are hard to separate because they are so close 
together. 

We will not explicitly analyze the costs of doing a random sample versus this 
"smart" sample since it is clearly very expensive to arrive at the smart sample: 
if no such list exists a priori one would need to compile it by making an index 
(which incidentally was our original goal) and then making small radius range 
or nearest neighbour queries. Situations are however conceivable where such a 
list can be obtained at low cost. 

In Figure 5.1 we have plotted results for a dataset coming from compressed 
image data from NASA [DIMACS]. This dataset has 40000 20-dimensional vec- 
tors, equipped with the Euclidian metric. Incidentally its estimated Chavez 
intrinsic dimension is 3.9. This dataset presented certain challenges as results 
lacked the stability of other datasets, e.g. for pivot size 20 the difference between 
worst and best result for the same parameters can be as high as 14 distance com- 
putations. This uncertainty remains even after varying the pivot sample size N. 
In our experiments we leave N fixed at 40 since varying it doesn't change results 
either, not just their variance. On the other hand this difference of 14 represents 
only about 4% so it doesn't affect the broad conclusions. The experiment was 
run using different sample sizes for the incremental selection technique: 5000, 
50000, and 10000. Testing was done by averaging 5000 queries that return an 
average of 40 points (that is 0.1% of the dataset). The overall result is that 
indexes based on the different samples all achieve about the same performance, 
with the optimal number of pivots around 60. The conclusion is that significant 
savings in construction time can be achieved by reducing sample size. Query 
time and index size however remain the same for all 3 cases. 

Figure 5.2 illustrates what happens when we select "smart" samples instead 
of random A pairs as before. All sample sizes are 5000, testing done by averag- 
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20 40 60 80 100 

Figure 5.1: Distance computations vs number of pivots for tlie NASA dataset 
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Figure 5.2: Distance computations vs number of pivots for the NASA dataset 
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Figure 5.3; Distance computations vs number of pivots for an 8-dimensional uniform 
cube dataset 

ing 5000 queries. We produce two sets of smart samples: one based on running 
20-nearest-neighbour (20NN) queries, where 5000 random queries are made and 
their 20th nearest neighbour is selected as the second element of the pair, thus 
producing 5000 pairs. The second sample is produced similarly using lOONN 
queries. Here we notice that the choice of which smart sample to select matters 
somewhat, as the 20NN leads to a slightly better performing index. The differ- 
ence is however small and must be balanced with the expense of construction 
for as we mentioned before we need to build a preliminary index to create the 
smart sample. 

We repeat the smart sample experiments on a synthetic dataset: the uniform 
cube of dimension 8. The results here show even less difference between random 
and smart samples of A pairs. In addition variation is very low, with repeat 
experiments giving virtually identical results. Perhaps more interesting is to do 
simulations where the dimension d rises. Unfortunately hardware and software 
limitations severely limit the number d, with our setup running out of memory 
at d = 14. We can point however to results by [Bus03] where performance for 
optimal number of pivots is plotted against various values of d from 2 to 14, 
for uniformly distributed cubes. The results clearly demonstrate exponential 
growth in d no matter the pivot selection technique. In other words, so far, no 
matter how contrived the building algorithm for pivot-based indexing we have 
yet to find one that breaks the curse of dimensionality. Our proposal to vary 
the sample selection for the incremental technique performs just like the rest in 
that respect. 
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In summary, it is impossible to prove the curse of dimensionality by sim- 
ulations alone as we could always ask ourselves maybe another index building 
algorithm exists. This is why our theoretical result is so powerful: we are able 
to make the conclusion for any pivot-index building method. We are leaving 
thus several options for the practitioner: to try better indexes, control intrinsic 
dimension, or limit the size of the dataset. 

The solution so far seems to have been to simply mostly avoid high-dimensional 
datasets. Bar some breakthrough in indexing technology the advice then for 
those dealing with truly high-dimensional data is to limit the dataset size in 
relation to available processing power: this means no exponential growth in size 
is to be allowed. 
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